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ABSTRACT 


This project aims to show prototypes of bot-like artificial intelligence made over social 
media such as Twitter. Basic algorithms can learn how people use emojis, for what purposes 
emoji combinations are used with a basic program and obtain an algorithm that can analyze 
basic emojis for future projects. Today, rather than programming artificial intelligence for 
each process, it is a more effective and logical method to design an algorithm for how the 
program can learn something and enable it to learn independently with the data given to it. In 
this project, I decided to design an algorithm centered on the k-NN algorithm, one of the most 
basic learning algorithms of artificial intelligence, and organize a data set to show the effect 
of the design and make the arrangements. I chose social media that people shared for these 
processes, and I chose Twitter. There can be exciting results in terms of people’s use of 
emojis and writing articles on different subjects. While an individual can share fossils of sea 
creatures that lived 505 million years ago, or academic studies on the potential exo-life under 
the glaciers of the satellite Europa, in one post, they can also share in very different styles, 
such as drawings of their favorite game character or birthday celebrations. And different 
emoji combinations could occur in the posts on this whole topic. While it was difficult for 
people to express the subject or feelings, this project required preparing accordingly for a 
program. I modified and edited a data set designed in previous projects for this process. 
Thanks to these arrangements, the data set has become more useful. Thanks to this dataset, the 
program uses the algorithms I made in the following parts to learn through this dataset. Before 
I started programming, I selected to write the code on the Python program, which I used to 
work on artificial engineering before, and I shaped the algorithm. After this process, I 
designed the algorithm required for the program to read this data set. This algorithm can 
access and update the folders in the dataset. In this way, it is ensured that it reaches the data 
required for the next stage. In addition, during this process, I created criteria that are ideal for 
the program to choose some emojis randomly and the data before it can be a classification and 
comparison criterion, but limited enough to allow us to observe the errors to a certain extent. 
After preparing this data, I designed a code so that each tweet share is reviewed only once 


during each learning session, that is, at the stage of the program, to avoid multiple sharing 


among the data sets on Twitter. At the end of these processes, I started to process the process 
required for the k-NN algorithm and coded the method of the k-NN algorithm used to adapt it 
to my program. 

The algorithm I use is the k-nearest neighbor algorithm or k-NN is a kind of learning 
method developed by Evelyn Fix and Joseph Hodges in 1951 and is generally explained in 
statistics [2]. The logic of the k-NN algorithm is to compare the previous ones with each 
other. I have planned some experiments to choose the best possible version of the k 
parameter. The best k choices are prepared; in general, larger k values give data to the 
targeted quality and consist of data accuracy [3]. Of course, the fact that it is more makes the 
relationship between classes less obvious. The closest to the class is called the special (k=1), 
closest to the educational situation, which is estimated to be the classroom. Here, the use of 
emoji patterns together will not fit into a fixed pattern, and k=1 will not be made to keep the 
learning of the plan flexible to a certain extent. Experiment to find the ideal spot to stay in. k- 
NN is a type of classification where the function is only locally approximated, and all 
calculations are deferred until function evaluation. Since this algorithm is based on the 
distance for classification, it is a straightforward and convenient design. It is simple but 
effective. However, normalizing the training data can improve its effectiveness if the features 
represent different physical units or come in vastly different scales [1-3]. After I edited these 


codes, I added the programs based on “print” that would give them to us. 
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1. INTRODUCTION 


At the beginning of this project, my primary aim was to show artificial the learning 
mechanisms of artificial intelligence from the most basic and to show the working 
mechanisms by testing their effect on sentiment analysis, such as emojis. First, I will explain 
the k-NN algorithm and its usage areas. I will explain machine learning, which is the aisle use 
of the k-NN system, and will show it in this project. One of the problems that can be solved 
by machine learning is classification problems with a wide range of uses. Many problems 
today can be designed and solved as a classification problem. In this situation, it can be 
ensured that the program learns whether the customer can pay the loan, recognize the texts or 
use the emojis in what cases. The software can update itself independently of the human. 
There are a lot of algorithms produced to solve these problems. The k-NN (k-Nearest 
Neighbors) Algorithm was used based on this project. In the simplest sense, k-NN is based on 
estimating the class of the vector formed by the independent variables of the value to be 
estimated, based on the information in which class the nearest neighbors are dense. The k-NN 
Algorithm makes predictions on two basic values. One of them is distance. The distance of 
the point to be estimated from other points is calculated. Another is k (number of 
neighborhoods). We tell you how many nearest neighbors the calculation will be made. k 
value will directly affect the result. If k is 1, the probability of overfitting will be very high. If 
it is too large, it will give very general results. For this reason, estimating the optimum K 
value remains the main subject of the problem. k-NN is one of the algorithms used for 
classification and regression in Supervised Learning. It is considered the most straightforward 
machine learning algorithm. Unlike other Supervised Learning algorithms, it does not have a 
training phase. Training and testing are pretty much the same things. It is a lazy type of 
learning. Therefore, k-NN is not an ideal candidate as the algorithm is required to process a 
large dataset. But in this project of ours, k-NN is in the perfect situation to show how artificial 
intelligence can learn emojis quickly and relatively easily and make observations on learning. 


k represents the amount of nearest neighbors of the unknown point. 


We choose k quantities of the algorithm (usually an odd number) to predict the results. 
In this project, I first started by editing a data set consisting of a large twitter share where 
emojis are used for different purposes. After that, I tried to separate the emojis into small but 


sufficient classes to be suitable for classification and to see the errors. Since we will use this 


created classification for data set analysis and classification in the future, I made adjustments 
to it. This was necessary for the program to learn and classify. Now let me explain the 
experiment we use the k-NN algorithm a little more. Our project aims to show that a program 
learns something that has not been taught to it before, with a basic and simple algorithm, just 
by examining a lot of data on that subject. Through the classification examples shown to it, 
each time it is different, but on average, it can be consistent and more importantly, it can learn 
quickly. a project. This program will examine the data without the biased view of humans and 
the problem of the slow speed of the people. And it will give us an idea of what emoji 
combinations are used in what ways and describe what moods. As a result of this study, we 
will also have an algorithm that can efficiently analyze and learn how emojis are used in what 
subjects and moods. We will be able to use this algorithm and code collection in other 
projects. 

One thing to note is that k-NN is a type of pattern-based learning or lazy learning, 
where the function is only approximated locally, and all computation is deferred until 
classification. The specified k-NN algorithm is among the simplest of all machine learning 
algorithms. For both classification and regression, it may be helpful to weigh the 
contributions of neighbors so that close neighbors contribute more to the mean than those 
farther from the mean. For example, a standard weighting scheme involves assigning a weight 
of 1/d to each neighbor, where d is the distance from the neighbor. Neighbors are taken from a 
set of objects for which the class (for k-NN classification) or object property value (for k-NN 
regression) is known. This can be thought of as adjusted training for the algorithm, but no 
explicit training step is required. In addition, I can explain the reason for using the k-NN 
algorithm in this project as follows: 

If we want to study or make inferences about the common ancestor of all vertebrates on 
land, and if we examine the brain of a species that is a member of the superclass Osteichthyes 
(bony fish), we can infer the most basic structures of other vertebrate brains that have 
developed on top of it, because the brain has developed in a layered way. The inner part of the 
brain, which is the most central part of all living things, including Homo sapien, basically 
resembles a fish brain structure.” 

If we consider the example that I have explained in terms of taxonomy and biology, not 
the evolution of organic living things, but the evolution of learning programs, what I mean 
will be better understood. Suppose our goal is to understand the variables of communication 


and related methods of communication, such as emoji, with which future super-AlIs interact. 


In that case, we can do this through research and testing on their most basic primitive 
ancestors. 

Suppose you want to understand the potential of complex algorithms such as the GPT-3 
algorithm, which I will discuss in the following parts, perhaps one of the most important steps 
of artificial intelligence. In that case, we should examine and test the k-NN algorithm, which 
is the primitive ancestor of these advanced algorithms. With this in mind, I did experiments 


on k-NN. 
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2. METHODOLOGIES 


2.1 Dataset and Potential Program What Can Learn 

The basic arrangements and planning of the dataset are obtained. Why are we doing this 
program? What can this algorithm tell us about the practical classification of a program? It 
was spent trying to answer similar questions. This part, after giving ideas about what we can 
learn from Twitter and editing the data set, I also made the necessary plans to make the self- 
learning ability of this algorithm usable in other projects. 

So, what is the connection of k-NN’s mechanism with “deep learning”, which is widely 
talked about today? In this regard, I need to point out that the k-NN algorithm basically shows 
certain similarities with the programs that work with the neural network’s method, which 
makes a significant breaking point today. I had done a study on this subject in 2021, and when 
my knowledge was fresh, I used these studies on another subject. k-NN is the learning of the 
program by comparing the received data with each other in different ways and updating its 
own information accordingly. Similarly, the deep learning method can be explained as the 
self-updating of the algorithm fed with as many data sets as possible by making comparisons. 
[8,12]. As in these operations, although the k-NN operations work by comparing the nearest 
and its numerical status with various methods and listing them in the same group, it is 
necessary to test the k variables and have a dataset to compare [11]. 

In this regard, besides a primitive program like k-NN, it is helpful to mention the GPT-3 
algorithm, a brilliant example of its kind and probably represents a breaking point that will go 
down in history. Suppose we get the personality analysis of a fictional character and the 
mathematical ratio of the words and sentences he uses. In that case, we can copy the 
personality of that character. When I was on a task to test it in a kind of chatbot or botmake.io 
software, the result obtained was that when I used the pattern I chose for “friend-bot”, the 
result could be similar to that character. . While I was experimenting with this, OpenAI 
company introduced the GPT-3 algorithm in 2020. Argentine Computer Engineer Manuel 
Araoz, who entered the Bata process of this algorithm, tried artificial intelligence with the 
personality and knowledge of Albert Einstein with this GPT-3 algorithm. The result from this 
algorithm, which can access all of Albert Einstein’s articles and interviews, resulted in a 
wonderful and shocking interview made with an almost exact copy [16]. In this case, if such a 
consistency can be obtained with sufficient data of a true believer, in the same way, if there is 


sufficient data and a personality report data that has already been issued, which I had 


11 


obtained, I concluded that this program would provide an opportunity to test my claim. I 
applied to the beta process of the program in 2021 to be able to experience this and was 
accepted. The data I obtained were far beyond my expectations, and the attitudes of the 
artificial intelligence produced were largely similar, from the knowledge of the 
aforementioned character in his own fictional universe to the word patterns he used [13]. The 
previous version of GPT-3 is over 100 times the number of parameters the GPT-2 was trained 
on. In other words, GPT-3 can process the 410 billion different information it collects with 
175 billion connections. While the computing power cost us three meals and 8 hours of sleep, 
OpenAI spent $4.6 million for GPT-3 to understand how languages work and are structured. 
A large number of parameters and enormous data used to train this model is also important as 
it virtually eliminates the need for fine-tuning. GPT-3 uses semantic analytics to learn how to 
construct language structures such as sentences. It not only examines words and their 
meanings but also develops an understanding of how the use of words differs depending on 
other words used in the text. This is a form of machine learning, also called unsupervised 
learning, because the training data do not contain any information about the “right” or 
“wrong” answer, as in supervised learning. All the information he needs to calculate the 
probability that his output is what the user wants is gathered from the training texts. This is 
done by examining the use of words and phrases, then breaking them down and trying to 
reconstruct them. In other words, as in the k-NN algorithm, it learns the data presented to it in 
this program by comparing it, but the only difference is that it can do this efficiently and by 
updating itself in different ways [4-7]. In this case, if we can produce a relatively successful 
algorithm about the logic of emojis in social media and in which situations their combinations 
can be used in the k-NN algorithm, our positive inferences about the potential of more 
advanced programs than k-NN will gain strength. Artificial intelligence built on learning 
algorithms will change the world, but GPT-3 is only an early step on that trajectory. If we can 
compare GPT-3 to a primitive species in this step, we can say that k-NN is a prokaryotic 
bacterium in an ecosystem. I continued to research from the very foundation of this level at 


certain stages. 
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2.2. Preparing the Dataset Code 

In this section, the operations are included, the processes required for the classification 
and arrangement of emojis, and compare what emojis are used for in the dataset. Thanks to 
these operations, it was necessary to construct the first drafts and outlines of the following 
codes. I researched the use of emojis on this subject, and I saw that even if we classify emojis 
into specific classes, people can actually use those emojis outside of that process. And that’s 
why I made a more efficient classification with additional classes such as party, animals, 
nature, as well as situations that express emotions such as happy, sad, angry, and surprised, 
but I did not add classes that are already too many, because the primary purpose here is to 
observe which emotions people use more often to describe emojis and to monitor how the 
program learned about it. 

The three distance functions mentioned above come into play in the other stage. 
Comparisons are made according to the criteria I specified in the following sections, and they 
are placed in the closest class. Finally, these data result in the software labeling the data in 
that criterion with the data it receives, “classifying” it [1-3]. 

While writing this algorithm, I used the phyton code language that I first learned at 
school and worked on the most, and the “scikit-learn” module of that language was used. 
“scikit-learn” is a machine learning module built from Numpy, SciPy, and Matplotlib 
modules. Scikit-learn provides simple and efficient tools for classification, regression, 
clustering, size reduction, model selection, and data manipulation, which are also used in data 
analysis. 

I had to arrange and classify the data set I obtained and researched by the algorithm to 
use it in some of my previous research. Because in the project, we can learn how the emoji 
combination is necessary to understand how many people have what emotional state, 
according to the use and variety of emojis, through the sharing made by various people. In 
fact, in this algorithm, we can see results that we would not usually think of it. For example, 
our program can discover that this algorithm can use emoji combinations that I usually never 
use for happiness. I created the “prepare dataset” code to allow the software to self-observe, 


classify and group emojis from the dataset. 


In the code, I first worked on the import part: 


import glob, os 


import codecs 
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Then I prepared the variable and data lists before hand: 


allT weets = set([]) 
emoji = {} 
emojiPairs = {} 


normalizedPairs = {} 


I prepared the code that allows the program to prepare the data of the emojis itself in a 


file as a text file and the folder where it will store the data: 


def initializeEmojis(): 
dir path = os.path.dirname(os.path.realpath(__ file _))+"\dataset" 
os.chdir(dir_path) 
for file in glob.glob("*.txt"): 


emojis[file.replace(".txt","")] = 0 


Thanks to the glob feature here, our program will allow the program to import and edit 


the data stored in other file formats from where it is stored as a text file. 


def initializeEmojiPairs(): 
for el in emojis: 
for e2 in emojis: 
if el = e2: 
continue 
if (el+e2) in emojiPairs: 
continue 
if (e2+e1) in emojiPairs: 
continue 
emojiPairs[el+e2] = 0 


We run the functions we have determined in order and make this code ready for the next 


k-NN: 


initializeEmojis() 
initializeEmojiPairs() 


scanDataset() 
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normalizePairs() 


with codecs.open("../node_distances.txt", 'w', encoding="utf8’) as f: 
f.write(str(normalizedPairs)) 


f.close() 


As a result of these processes, with the help of the data we gave to the program, the 
program performed the classification process and created a dataset that we could examine. 


During these processes, the emojis obtained from Twitter are analyzed. These emojis are 


shown in Figure 1. 
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Figure 1. The Emojis used in the Project 
As a result of these processes, our program made the following classification, as shown 


in Figure 2 and Figure 3. You can see how the program updates itself in text format and how 


it reaches this data through the images here. The codes mentioned above undertake the task of 


accessing these folders. 
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Figure 2. Data Set’s text data screenshot in the library 
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Figure 3. Data set inside cake emoji text data what about tweets. 


As a result of the research done here, we have a basic idea about people’s use of emojis, 
and we have an algorithm that can examine the data set I edited and update itself. Although it 
is a necessary code for the algorithms we will use in the next section, he will use what he has 


studied here for the k-NN algorithm. 


2.3 k-NN Code 

These are two parts of the code I designed. And there will be a bit more about learning 
in this part of the code. The k-NN algorithm is a good and productive start, as it is the core of 
algorithms that are much more comprehensive, that is, advanced learning algorithms along 
evolutionary lines. And using the benchmarking in this system, the algorithm can “learn” 
which emojis can be used in similar situations with which emoji. I wanted to do this project 
too. An example of this is that we want to analyze the shares on Twitter and spend it 
efficiently. The proportions and combinations of emojis in daily life texts are as much as hair 
dye. It can be helpful on occasion and how we learn about emojis. And these tools, our 
applications can be used for other emoji uses, and this is a work that can inspire us to apply 
this method to more applications. 

In order to solve the learning problem or to solve the classification problem of emojis, I 
preferred the most basic k-NN, namely the “k-Nearest Neighbors” algorithm, and focused on 


it at this stage. The k-NN algorithm makes predictions on two basic values; 
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Distance: The distance of the point to be estimated from other points is calculated. For 
this, the Minkowski distance calculation function is used. 

k (number of neighbors): We tell how many nearest neighbors to calculate. k value will 
directly affect the result. If k is 1, the probability of overfitting will be very high. If it is too 
large, it will give very general results. For this reason, estimating the optimum k value 
remains the main subject of the problem. The graph below shows the importance of the k 
value very nicely. If we choose k=3 (where the straight line is), the classification algorithm 
will identify the point indicated by the sign as the red triangle class. But if we choose k=5 (the 
area with the dashed line), the classification algorithm will define the same point as the blue 
square class. [1-3,12]. During the test, the classification of emojis by artificial intelligence is 
decided by identifying the most inconsistent ones and averaging them. 

At this stage, we started with k=3 and tested until k=6. When k increases up to a certain 
point, it becomes efficient, but the data becomes more inconsistent after a certain point. The 


same result was obtained in this test. 


When K= 3, there is a 28% Critical Error. 


When K= 4, there is a 22% Critical Error. 


When K= 5, there is a 20% Critical Error. 


When K= 6, there is a 27% Critical Error. 


When K= 8, there is a 38% Critical Error. 


My code uses these results by making similar comparisons within “if commands. This 


is the primary part of the algorithm I use: 


for el in emoji_set: 
if el in labels: 
continue 


neighbor_clusters = [] 
neighbor _emojis = [] 
i=k 
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for pair in node_ distances: 
if el in pair: 
e2 = pair.replace(el,"") 
if not e2 in 
In this part of the code, I updated the emojis in the previously determined folders and 
used pairs in this program—pairs in Python. To enable us to implement the concrete level of 
our data abstraction, Python provides a compound structure called a tuple, which can be 


constructed by separating values by commas. Although not strictly required, parentheses 


almost always surround tuples. If we explain the lists we compare and use here: 


import codecs 


node_ distances = {} 
clusters = {} 
emoji_set = set([]) 
k=5 

The part of the code I use here is the import and variable determination part at the 
beginning of the program. Here I used “codecs” in the import part. This module defines base 
classes for standard Python codecs (encoders and decoders) and provides access to the 
internal Python codec registry, managing the codec and error handling lookup process. Most 
standard codecs are text encodings, which encode text to bytes, but codecs are provided that 
encode text to text and bytes to bytes. Custom codecs may encode and decode between 
arbitrary types, but some module features are restricted to use specifically with text encodings 
or with codecs that encode to bytes. 

After this part, I added the code part that allows us to reach our previously arranged 


dataset: 
with codecs.open("node_distances.txt", 'r', encoding="utf8') as f: 


node_distances = eval(f.read()) 


This code will allow us to access the “distance” updates on the data. 


node_distances = 
dict(sorted(node_distances.items(), 
key=lambda item: item[1])) 


In this code part, we use node _ distance to update. Here I used the dict function. It is a 


function for creating dictionaries in Python. It exists in three different forms. In other words, 
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iterable as a type, data of mapping type are also available. The result produced will be a 


dictionary. 


node_ distances = 
dict(sorted(node_distances.items(), 
key=lambda item: item[1])) 


In the later parts, we use keys to access these lists: 


with codecs.open("result_clusters.txt", 'w', encoding="'utf8’) as f: 
wrote_sets = set([]) 
for el in clusters: 
kume = clusters[e1] 
if kume in wrote sets: 
continue 
wrote_sets.add(kume) 
print(kume+": ",end="") 
f.write(kume+": ") 
for e2 in clusters: 
if clusters[e2] == kume: 
print(e2,end="") 
f.write(e2) 
f.write("\n") 


print(" ") 
f.close() 


Also, in order to examine and use the data we have collected in this section, I used this 
code to transform them into data in a file format, which will be more effective than projecting 
them on the screen. Here I used the clusters function of the program. Cluster analysis or 
clustering is an unsupervised machine learning algorithm that groups unlabeled datasets. It 
aims to form clusters or groups using the data points in a dataset so that there is high intra- 


cluster similarity and low inter-cluster similarity. 


And [use this labels what is work foe make emoji class: 


clusters["@"] = "happy" 
clusters["@"] = "sad" 
clusters["©"] = "angry" 
clusters["(@)"] = "surprized" 
clusters["3"] = "sports" 
clusters['"*@"] = "animals" 
clusters["$2"] = "party" 
clusters[" 9 "] = "nature" 


labels = "OQ" "OO" "Oo", wr" "A" @ "] 
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2.4 Char Code 

The work done in this part is that Python’s code is designed to run if a particular library 
is loaded. I would like to point out in advance that this is intended to serve as an arrangement 
for the algorithm to be usable in other projects in the future and to help those who are going to 
do large projects. 

In this part, I aimed to write a final code in which all the data is graphed in order to 
make the data obtained as a result of the codes we have handled in the previous parts usable 
and analyzable. I prepared these codes, intending to finalize the data that the first data set was 
organized and then the data set by using the code of the data set and the second code using the 
k-NN algorithm. In this part, I had first to check the previous transactions and make sure they 
were in the proper way. After finishing some edits and running these programs, I have come 
to the end of my experiment on the potential of the k-NN algorithm and the potential of 
machine learning. We will have an algorithm with benchmark data about people’s use of 
emoji and programs that use data from this system one last thing to visualize and organize this 
data. I had to write code. 

In this part, I started my code with the import parts that I used in the first code. The part 
of the code I use here is the import and variable determination part at the beginning of the 
program. Here I used “codecs” in the import part. This module defines base classes for 
standard Python codecs (encoders and decoders) and provides access to the internal Python 
codec registry, managing the codec and error handling lookup process. Most standard codecs 
are text encodings, which encode text to bytes, but there are also codecs provided that encode 
text to text and bytes to bytes. Custom codecs may encode and decode between arbitrary 
types, but some module features are restricted to use specifically with text encodings or with 
codecs that encode to bytes. 


import glob, os 
import codecs 
import matplotlib.pyplot as plt 


Then I made add-ons to my code to calculate the number and content of the emoji 


folders that my code had previously classified as 


allTweets = set([]) 
emojiFreq = {} 

clusters = {} 
emotionFreq = {} 
emotionFreq["angry"] = 0 
emotionFreq["sad"] = 0 
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emotionFreq|"mixed"] = 0 
emotionFreq["sports"] = 0 
emotionFreq["surprized"] = 0 
emotionFreq|"animals"] = 0 
emotionFreq["party"] = 0 
emotionFreq|"nature"|] = 0 
emotionFreq["happy"] = 0 


Then I continued my code by editing some function patterns. 


with codecs.open("result_clusters.txt", 'r', encoding='utf8') as f: 
Lines = f.readlines() 
count = 0 
for line in Lines: 
s = line.strip().split(": ") 
clusters[s[0]] = s[1] 
emotionFreq|s[0]] = 0 


The first function added above builds on the codec function I mentioned earlier and 
allows us to access the data set folder that has been updated as a result of the actions of 


previous code projects. The f.readlines will enable us to read database files in text format. 


def initializeEmojis(): 
dir_path = os.path.dirname(os.path.realpath(__ file _))+"\dataset" 
os.chdir(dir_path) 
for file in glob.glob("*.txt"): 
emojiFreq|file.replace(".txt","")] = 0 


After that, we prepared the code for the classification of emojis and examined the data 
on this subject. 


def scanDataset(): 
dir_path = os.path.dirname(os.path.realpath(__file _))+"\dataset" 
os.chdir(dir_path) 
for file in glob.glob("*.txt"): 
f = open(dir_patht"\\"+file, encoding="utf8") 
print("%"+str(len(allTweets)/46468 1)[2:4]) 
for line in f: 
if line in allTweets: 
continue 
allTweets.add(line) 
for e in emojiFreq: 
if e in line: 
emojiFreq[e] += 1 


22 


The code here allows us to examine the tweeters in order and repeat them. Then I coded 


the functions that enable the collection of emojis. 


def calculateEmotionFreq(): 
for emoji in emojiFreq: 
for label in clusters: 
if emoji in clusters[label]: 
emotionFreq|label] += emojiFreq[emoj1] 


def calculate2(): 
for tweet in allTweets: 
tweetEmotion = None 
for emoji in emojiFreq: 
if not emoji in tweet: 
continue 
for label in clusters: 
if not emoji in clusters[label]: 
continue 
if tweetEmotion == None: 
tweetEmotion = label 
continue 
if tweetEmotion != label: 
tweetEmotion = "mixed" 


if tweetEmotion != None: 
emotionFreq[tweetEmotion] += | 
After arranging these functions, I took care of the last parts of the code to make them 


work in order. In this way, I have completed the final version of the previous program. 


initializeEmojis() 
scanDataset() 
calculateEmotionFreq() 
calculate2() 


tweetCount = len(allTweets) 


D = emotionFreq 

plt.bar(range(len(D)), list(D.values()), align='center') 
plt.xticks(range(len(D)), list(D.keys())) 

plt.show() 


We can get precise results by converting all the data into a final data output by working 


together with the code pieces. 
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3 CONCLUSION AND DISCUSSION 


This project aims to do some experiments showing the effect of the k variable and the 
importance of the data set in such algorithms and come to conclusions. I have prepared output 
folders and datasets that make it easy to review each result, even though anyone can use it. As 
a result of this process, we have an algorithm arranged to be comfortable for others to develop 
and use. 

As a result of these studies, we have proven that it is possible to learn a situation where 
the probability of different combinations such as emojis is very high. We use k-NN algorithm, 
which can be considered a relatively primitive learning algorithm and can be considered as 
the ancestral code of artificial intelligence. At the same time, we had fruitful ideas about the 
potential of machine learning. I have come to conclusions about how future programs should 
be. I observed that the efficiency of the variable k increased at a specific rate, but it suddenly 
had a higher error rate after k=5, and I adjusted the program accordingly. 

But the situation I want to mention here is to explain my observations on k in the k-NN 
algorithm by giving different examples, which I think will provide a clearer idea about the 
problems of the k-NN algorithm and how it can be improved. I mentioned that the k-NN 
algorithm is an algorithm that classifies data that is desired to be classified according to its 
proximity to previous data. So what kind of variability can the variable k cause here? 

For example, If we take k = 3, the distances of the new data to the old data are 
measured, and the closest three are determined. Let 2 be from class A and 1 from class B of 
the most relative data from our fictional data. In this case, the algorithm decides that the new 
incoming data is in class A. Consistent so far, but specific problems can arise. As I said 
before, the quality and quantity of the data set stand out as an essential factor in this [9]. 

While we are talking about k-NN in general, the Euclidean method, which comes to us 
from ancient times to the present day, is the first prominent distance calculation method. Still, 
different techniques such as Manhattan and Minkowski method can be used, and the 
stereotypical use of these algorithms has been made accessible to everyone today. Therefore, 
it is possible to make the program more efficient if they are used all at once, but this may 
negatively affect the program’s efficiency [10]. 

At this point, I want to focus more on the programs that can be learned, on which k-NN 
is the basis. I have mentioned it often because the GPT-3 algorithm is a good start. In artificial 
intelligence, a technology known as GPT-3 caused great excitement in 2020 and is still in 


effect. Developed by OpenAI, GPT-3 is an artificial intelligence model better than any 
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previous model at creating content with a language structure such as human or machine 
language. But there is some confusion about precisely what it does (and does not do). First, 
GPT-3 stands for Generative Pre-trained Transformer 3 (“Generative Pre-trained Transformer 
3”). As the name suggests, GPT-3 is the 3rd version of the model released by OpenAI that can 
generate text using pre-trained algorithms. All the data he needs to carry out his duties is 
provided in advance. Specifically, the OpenAI team scoured the Internet for publicly available 
datasets known as “CommonCraw!”, including the entire Wikipedia text, and fed GPT-3 with 
the aggregated text data size. Earlier, I explained in more detail how efficient this system is. If 
you pose a question to GPT-3, it will give you the most helpful answer. If you ask him to 
perform a task such as compiling a summary or writing a poem, he will write a summary or a 
poem. More technically speaking, GPT-3 is the largest neural network ever created. [4,5,6] 

The “neural network” based algorithms used in the new generation artificial intelligence 
examples that I mentioned before can be explained as more efficient versions of our k-NN 
algorithm’s comparison method, and I think that if we are going to make inferences about the 
future of the k-NN algorithm and how it can be better, we should examine these new 
generation learning programs. It is vital to examine this GPT-3 using the neural network 
method, which uses advanced algorithms but is in the same evolutionary line as it. 

Just as we looked at the primitive version of k-NN to examine the working and potential 
of today’s advanced programs, now we will focus on the errors of k-NN and how it can be 
improved by reviewing it a little over GPT-3. GPT-3 is highly revolutionary, and if it proves 
usable and useful in the long run, it could have enormous implications for the way software 
and applications are developed in the future. Because the code itself is not yet publicly 
available, selected developers can only access the code through an API (Application 
Programming Interface or “Application Programming Interface”) provided by OpenAI. As I 
said before, the most significant inconvenience for someone who will use my k-NN-based 
algorithm will be user-friendliness only when accessing the program. Still, I think that the 
program’s self-regulation and self-regulation, as in GPT-3, is the future of programs and, of 
course, programs that analyze emotions like in my work. At the beginning of the article, I 
mentioned that you had read the entire Wikipedia. In artificial intelligence terminology, this 
means 3 billion “tokens”. Besides Wikipedia, GPT-3 has “read” two huge databases of books 
with 12 billion and 55 billion tokens, and all the books ever written and digitized. GPT-3 
“knows” all encyclopedias, books, and everything written on the Internet. This value is 
equivalent to a total of 410 billion tokens. However, for intelligence, besides knowing and 


processing information; So learning is also fundamental [4,6]. 
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Human beings process information with about 100 trillion “neural networks” in our 
brains in situations such as the use of complex tools and art productions, which are our basic 
skills in communication and presentation. GPT-3 is a massive leap because of the number of 
these parameters, though less than human brain connections: 175 billion! It is over 100 times 
the number of parameters the previous version of GPT-2 was trained on. In other words, we 
can say that GPT-3 can process the 410 billion different information it collects with 175 
billion connections. These results lead us to think about the possible results if the number of 
contacts was the same as ours [4,5]. 

GPT-3 uses semantic analytics to learn how to construct language structures such as 
sentences. Of course, potentially in languages and emojis with visual expression. It examines 
words and their meanings and develops and updates an understanding of how words, 
situations, and combinations differ depending on other words used in the text. This method is 
a form of machine learning, also called unsupervised learning, because the training data do 
not contain any information about the “right” or “wrong” answer, as in supervised learning. 
Although this method is a more advanced version of the k-NN we used in this project, it has 
similarities. These programs will likely make mistakes, potentially multiple times at first. 
However, he will eventually find the right word. Checking the original input data will know it 
has the correct output and assigns a “weight” to the algorithm process that provides the 
correct answer. Similar to learning in infants, it means that in the future, it will gradually 
“learn” which methods are most likely to give the correct answer, and it chooses from the 
information it has constructed from a list of the right answers. The scale of this efficient and 
dynamic “weighting” process is one of the features that make GPT-3 the largest neural 
network ever created. 

So, do these algorithms have any shortcomings? For example, why are simple 
algorithms like k-NN still used? Is there any reason why relatively simple algorithms like k- 
NN are used today? This is where the GPT-3’s shortcomings come into play, and k-NN shines 
thanks brightly to these shortcomings: 

1) Due to a large amount of computational power required to perform its task, GPT-3 is 
very expensive to use. So it was sadly cut in my experiments. 

2) The fact that it is not public and therefore the program cannot be examined can cause 
trust issues. 


3) GPT-3 outputs are not perfect yet. It still makes some mistakes. 
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We can make fantastic but genuine predictions by looking at k-NN’s evolutionary 
descendant GPT-3 algorithm. But we can notice the deficiencies in the k-NN algorithm. By 
deduction from this, I can say that although it is possible for our learning programs to analyze 
and classify emoji or emoji-like symbols with a relatively simple program with artificial 
intelligence, in the future, it will be more efficient with artificial intelligence such as GPT-3, 
which is user-friendly, can update itself efficiently and is powered by versatile algorithms. 
Thanks to the calculations used, it can be used to classify a lot of data with relatively practical 
methods. This method is already a way of comparison, which is at the center of artificial 
intelligence, in which the modern, most advanced algorithms I mentioned earlier work, but at 
the same time, methods should be added to ensure that the data given to the program is 
reliable and its quality is checked, as well as the way the learning algorithm is programmed 
and designed. 

Nowadays, algorithms that interact with people on social media through writing or 
chatbots (friend-bots) that have personality and can infer from the writing style of the other 
person gain more importance than today’s standard chatbots. The importance of 
understanding the emotions of human users with whom algorithms interact is increasing. And 
as I mentioned before, the most basic and most efficient way of doing this is to understand 
these feelings through emojis. A situation that can be explained with various combinations of 
words can only be described with an emoji. The use of these features by algorithms in the 
near future will gain importance in many different sectors such as the gaming industry or 
personal artificial intelligence assistants. 

But only such studies do not have to be limited to a relatively current issue such as 
certain behaviors with emoji combinations or just us. Assuming that we express the behavior 
patterns made in certain situations in most animal species with only one symbol, it can be 
used to understand the logic in the behavior and “communication” ways of another species by 
using the symbol combinations that will emerge in these programs. Or, as in emoji 
communication, it can be used to understand better languages that express some situations 
with symbols. For example, we can transfer the behavior of every species that lives on our 
planet, such as the Octopus, that we have seen recently developed abnormal social behaviors 
to programs as If they are symbols. We can use algorithms like the programs I made to 
analyze the combinations of these behaviors. In this way, there is a potential to use machines 
to understand our emotions, and also it will be possible to understand the minds of other 


animals on our planet. Another potential use case is using these tars algorithms to decode 
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some ancient languages that we still can’t fully decipher, such as the symbol-based Phrygian 
language. In this way, software that can compare the inferences about what the past 
civilizations used these symbol combinations faster than we can pave the way for us to 
understand these languages better. 

It is extraordinary to work on emojis of such programs or ideographic tools that can be 
named “stamp”, “#&3C==”, or symbols that only represent emotions. [14] When we think 
about it, the first writing is Sumerian or, as the Sumerian people say, eme-gir (#+!£). It was 
basically a language that evolved from visual representation, and it is the oldest known 
language; from 3000 BC to the present day, basically visual representation has somehow 
continued, and now I have to say that it is remarkable and pleasant to design how a program 
will learn them and write about it [15]. I hope the algorithm I made or the articles I wrote will 
inspire someone on this subject. 

In order to show the potential of machine learning and future examples that I used in 
this project, I also tested the algorithm that used 2 different types of deep learning methods. 
The purpose of these tests was to demonstrate the potential of deep learning in practice during 
my presentation. One of these projects is B.C. As shown inf Figure 4, I used a computer scan 
from the mummy of Pharaoh Ramesses II (r‘-ms-sw - Ri‘a-masi-si), who lived from 1303 to 
1213 B.C. This algorithm, which generally uses the lines of the faces of the people in the old 
photos, used the human videos in the data to produce its animations. I used my experiment in 
my presentation on deep learning to show how effective deep learning can be and its potential 
in future animation projects. In this presentation, after a long time, we had the opportunity to 
look at the face of Ramesses II properly. I used the results of this algorithm called 
“DeepNostajia” in my presentation. 

The next attempt is that I chose for the test I made on the GPT-3 algorithm in order to 
prove my thesis that when we obtained the personality analyses of the fictional characters and 
the mathematical ratios of the word combinations they used, artificial intelligence and a 
conversation artificial intelligence that will be formed as a result of these data can be largely 
similar to that character. As shown in Figure5, I used a popular character from the Nintendo 
company named “Krystal” for a visual test. I used Artbender in an algorithm that produces its 
works with a dataset containing a variety of artworks. This test was necessary because the 
character I chose was a beautiful blue fox female character with humanoid features and 
human anatomy. In other words, to draw attention to the potential of deep learning and its 
impact on the future game and animation industry. To illustrate, to show how the algorithm 


can mimic such an extreme since it is a character different from even a human. Another point 
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I would like to point out here is why this test is an extreme example that I deliberately chose 
for deep learning. Although the facial features of the character used here carry the human 
features to a certain extent, they generally differ from the standard human facial features. 
Thus, the desired drawing of the character here was different from the standard art movements 
in general and different from the inputs frequently seen in the data set. In this case, it was an 
ideal example for us to test the deep learning and potential of the algorithm at an extreme 
point.I have included the results of this test in my presentation. With these different studies, I 
have shown how deep learning is successful and can make mistakes. I also drew attention 


during my presentation on deep learning. 


C) MyHeritage 


Figure 5. Results of the Artbender algorithm. 
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As a result of this test, I came to the end of my research on how the most basic systems 
that can be monitored by means of the tests we have prepared to get an idea about the 
potential of deep learning and how it can evolve in the future, and how machine learning does 


not analyze people’s emotions. I will make my algorithm public so that it can be observed. 
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4 ENTITY-RELATION 


In this project, nothing was done to require using an ER diagram, and there is no need 
for a diagram in this part, as it is a project and research that includes three different algorithms 


and different tests. 
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5 INTERDISCIPLINARY 


Since I did the tests on deep learning in this project, I only used research on these topics 


and information about ancient languages. 
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